{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MAGIC Gamma Telescope - TPOT Classification Study"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The below gives information about the data set:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data are MC generated (see below) to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruction of the shower parameters. The available information consists of pulses left by the incoming Cherenkov photons on the photomultiplier tubes, arranged in a plane, the camera. Depending on the energy of the primary gamma, a total of few hundreds to some 10000 Cherenkov photons get collected, in patterns (called the shower image), allowing to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background). \n",
    "\n",
    "Typically, the image of a shower after some pre-processing is an elongated cluster. Its long axis is oriented towards the camera center if the shower axis is parallel to the telescope's optical axis, i.e. if the telescope axis is directed towards a point source. A principal component analysis is performed in the camera plane, which results in a correlation axis and defines an ellipse. If the depositions were distributed as a bivariate Gaussian, this would be an equidensity ellipse. The characteristic parameters of this ellipse (often called Hillas parameters) are among the image parameters that can be used for discrimination. The energy depositions are typically asymmetric along the major axis, and this asymmetry can also be used in discrimination. There are, in addition, further discriminating characteristics, like the extent of the cluster in the image plane, or the total sum of depositions. \n",
    "\n",
    "https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Attribute Information:\n",
    "\n",
    "1. fLength: continuous # major axis of ellipse [mm] \n",
    "2. fWidth: continuous # minor axis of ellipse [mm] \n",
    "3. fSize: continuous # 10-log of sum of content of all pixels [in #phot] \n",
    "4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio] \n",
    "5. fConc1: continuous # ratio of highest pixel over fSize [ratio] \n",
    "6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm] \n",
    "7. fM3Long: continuous # 3rd root of third moment along major axis [mm] \n",
    "8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm] \n",
    "9. fAlpha: continuous # angle of major axis with vector to origin [deg] \n",
    "10. fDist: continuous # distance from origin to center of ellipse [mm] \n",
    "11. class: g,h # gamma (signal), hadron (background) \n",
    "\n",
    "g = gamma (signal): 12332 \n",
    "h = hadron (background): 6688 \n",
    "\n",
    "For technical reasons, the number of h events is underestimated. In the real data, the h class represents the majority of the events. \n",
    "\n",
    "The simple classification accuracy is not meaningful for this data, since classifying a background event as signal is worse than classifying a signal event as background. For comparison of different classifiers an ROC curve has to be used. The relevant points on this curve are those, where the probability of accepting a background event as signal is below one of the following thresholds: 0.01, 0.02, 0.05, 0.1, 0.2 depending on the required quality of the sample of the accepted events for different experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "from tpot import TPOT\n",
    "from sklearn.cross_validation import train_test_split\n",
    "import pandas as pd \n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Flength</th>\n",
       "      <th>Fwidth</th>\n",
       "      <th>Fsize</th>\n",
       "      <th>Fconc</th>\n",
       "      <th>Fconc1</th>\n",
       "      <th>Fasym</th>\n",
       "      <th>Fm3long</th>\n",
       "      <th>Fm3trans</th>\n",
       "      <th>Falpha</th>\n",
       "      <th>Fdist</th>\n",
       "      <th>Class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>28.7967</td>\n",
       "      <td>16.0021</td>\n",
       "      <td>2.6449</td>\n",
       "      <td>0.3918</td>\n",
       "      <td>0.1982</td>\n",
       "      <td>27.7004</td>\n",
       "      <td>22.0110</td>\n",
       "      <td>-8.2027</td>\n",
       "      <td>40.0920</td>\n",
       "      <td>81.8828</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>31.6036</td>\n",
       "      <td>11.7235</td>\n",
       "      <td>2.5185</td>\n",
       "      <td>0.5303</td>\n",
       "      <td>0.3773</td>\n",
       "      <td>26.2722</td>\n",
       "      <td>23.8238</td>\n",
       "      <td>-9.9574</td>\n",
       "      <td>6.3609</td>\n",
       "      <td>205.2610</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>162.0520</td>\n",
       "      <td>136.0310</td>\n",
       "      <td>4.0612</td>\n",
       "      <td>0.0374</td>\n",
       "      <td>0.0187</td>\n",
       "      <td>116.7410</td>\n",
       "      <td>-64.8580</td>\n",
       "      <td>-45.2160</td>\n",
       "      <td>76.9600</td>\n",
       "      <td>256.7880</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23.8172</td>\n",
       "      <td>9.5728</td>\n",
       "      <td>2.3385</td>\n",
       "      <td>0.6147</td>\n",
       "      <td>0.3922</td>\n",
       "      <td>27.2107</td>\n",
       "      <td>-6.4633</td>\n",
       "      <td>-7.1513</td>\n",
       "      <td>10.4490</td>\n",
       "      <td>116.7370</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>75.1362</td>\n",
       "      <td>30.9205</td>\n",
       "      <td>3.1611</td>\n",
       "      <td>0.3168</td>\n",
       "      <td>0.1832</td>\n",
       "      <td>-5.5277</td>\n",
       "      <td>28.5525</td>\n",
       "      <td>21.8393</td>\n",
       "      <td>4.6480</td>\n",
       "      <td>356.4620</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Flength    Fwidth   Fsize   Fconc  Fconc1     Fasym  Fm3long  Fm3trans  \\\n",
       "0   28.7967   16.0021  2.6449  0.3918  0.1982   27.7004  22.0110   -8.2027   \n",
       "1   31.6036   11.7235  2.5185  0.5303  0.3773   26.2722  23.8238   -9.9574   \n",
       "2  162.0520  136.0310  4.0612  0.0374  0.0187  116.7410 -64.8580  -45.2160   \n",
       "3   23.8172    9.5728  2.3385  0.6147  0.3922   27.2107  -6.4633   -7.1513   \n",
       "4   75.1362   30.9205  3.1611  0.3168  0.1832   -5.5277  28.5525   21.8393   \n",
       "\n",
       "    Falpha     Fdist Class  \n",
       "0  40.0920   81.8828     g  \n",
       "1   6.3609  205.2610     g  \n",
       "2  76.9600  256.7880     g  \n",
       "3  10.4490  116.7370     g  \n",
       "4   4.6480  356.4620     g  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Load the data\n",
    "telescope=pd.read_csv('MAGIC Gamma Telescope Data.csv')\n",
    "telescope.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen in the above the class object is organized here, and hence for better results, I start with randomly shuffling the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Flength</th>\n",
       "      <th>Fwidth</th>\n",
       "      <th>Fsize</th>\n",
       "      <th>Fconc</th>\n",
       "      <th>Fconc1</th>\n",
       "      <th>Fasym</th>\n",
       "      <th>Fm3long</th>\n",
       "      <th>Fm3trans</th>\n",
       "      <th>Falpha</th>\n",
       "      <th>Fdist</th>\n",
       "      <th>Class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6497</th>\n",
       "      <td>24.6124</td>\n",
       "      <td>14.6224</td>\n",
       "      <td>2.3086</td>\n",
       "      <td>0.5012</td>\n",
       "      <td>0.2826</td>\n",
       "      <td>9.1300</td>\n",
       "      <td>14.1010</td>\n",
       "      <td>-13.0228</td>\n",
       "      <td>4.6460</td>\n",
       "      <td>225.7730</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18126</th>\n",
       "      <td>144.7300</td>\n",
       "      <td>43.9751</td>\n",
       "      <td>3.4122</td>\n",
       "      <td>0.1033</td>\n",
       "      <td>0.0571</td>\n",
       "      <td>11.6485</td>\n",
       "      <td>-147.6160</td>\n",
       "      <td>33.0563</td>\n",
       "      <td>2.4463</td>\n",
       "      <td>208.0640</td>\n",
       "      <td>h</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4699</th>\n",
       "      <td>27.7069</td>\n",
       "      <td>9.7056</td>\n",
       "      <td>2.4023</td>\n",
       "      <td>0.4792</td>\n",
       "      <td>0.2594</td>\n",
       "      <td>-39.0670</td>\n",
       "      <td>15.3697</td>\n",
       "      <td>-6.6979</td>\n",
       "      <td>32.8160</td>\n",
       "      <td>154.3420</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3291</th>\n",
       "      <td>45.0159</td>\n",
       "      <td>20.7843</td>\n",
       "      <td>2.9297</td>\n",
       "      <td>0.2434</td>\n",
       "      <td>0.1311</td>\n",
       "      <td>36.9017</td>\n",
       "      <td>-15.5396</td>\n",
       "      <td>10.2351</td>\n",
       "      <td>10.5580</td>\n",
       "      <td>158.9890</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16378</th>\n",
       "      <td>64.8094</td>\n",
       "      <td>30.1755</td>\n",
       "      <td>3.2292</td>\n",
       "      <td>0.3136</td>\n",
       "      <td>0.2251</td>\n",
       "      <td>-76.8601</td>\n",
       "      <td>-38.6044</td>\n",
       "      <td>-13.6838</td>\n",
       "      <td>32.7431</td>\n",
       "      <td>278.4021</td>\n",
       "      <td>h</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Flength   Fwidth   Fsize   Fconc  Fconc1    Fasym   Fm3long  Fm3trans  \\\n",
       "6497    24.6124  14.6224  2.3086  0.5012  0.2826   9.1300   14.1010  -13.0228   \n",
       "18126  144.7300  43.9751  3.4122  0.1033  0.0571  11.6485 -147.6160   33.0563   \n",
       "4699    27.7069   9.7056  2.4023  0.4792  0.2594 -39.0670   15.3697   -6.6979   \n",
       "3291    45.0159  20.7843  2.9297  0.2434  0.1311  36.9017  -15.5396   10.2351   \n",
       "16378   64.8094  30.1755  3.2292  0.3136  0.2251 -76.8601  -38.6044  -13.6838   \n",
       "\n",
       "        Falpha     Fdist Class  \n",
       "6497    4.6460  225.7730     g  \n",
       "18126   2.4463  208.0640     h  \n",
       "4699   32.8160  154.3420     g  \n",
       "3291   10.5580  158.9890     g  \n",
       "16378  32.7431  278.4021     h  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "telescope_shuffle=telescope.iloc[np.random.permutation(len(telescope))]\n",
    "telescope_shuffle.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above also rearranges the index number and below I basically reset the Index Numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Flength</th>\n",
       "      <th>Fwidth</th>\n",
       "      <th>Fsize</th>\n",
       "      <th>Fconc</th>\n",
       "      <th>Fconc1</th>\n",
       "      <th>Fasym</th>\n",
       "      <th>Fm3long</th>\n",
       "      <th>Fm3trans</th>\n",
       "      <th>Falpha</th>\n",
       "      <th>Fdist</th>\n",
       "      <th>Class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>24.6124</td>\n",
       "      <td>14.6224</td>\n",
       "      <td>2.3086</td>\n",
       "      <td>0.5012</td>\n",
       "      <td>0.2826</td>\n",
       "      <td>9.1300</td>\n",
       "      <td>14.1010</td>\n",
       "      <td>-13.0228</td>\n",
       "      <td>4.6460</td>\n",
       "      <td>225.7730</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>144.7300</td>\n",
       "      <td>43.9751</td>\n",
       "      <td>3.4122</td>\n",
       "      <td>0.1033</td>\n",
       "      <td>0.0571</td>\n",
       "      <td>11.6485</td>\n",
       "      <td>-147.6160</td>\n",
       "      <td>33.0563</td>\n",
       "      <td>2.4463</td>\n",
       "      <td>208.0640</td>\n",
       "      <td>h</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>27.7069</td>\n",
       "      <td>9.7056</td>\n",
       "      <td>2.4023</td>\n",
       "      <td>0.4792</td>\n",
       "      <td>0.2594</td>\n",
       "      <td>-39.0670</td>\n",
       "      <td>15.3697</td>\n",
       "      <td>-6.6979</td>\n",
       "      <td>32.8160</td>\n",
       "      <td>154.3420</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>45.0159</td>\n",
       "      <td>20.7843</td>\n",
       "      <td>2.9297</td>\n",
       "      <td>0.2434</td>\n",
       "      <td>0.1311</td>\n",
       "      <td>36.9017</td>\n",
       "      <td>-15.5396</td>\n",
       "      <td>10.2351</td>\n",
       "      <td>10.5580</td>\n",
       "      <td>158.9890</td>\n",
       "      <td>g</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>64.8094</td>\n",
       "      <td>30.1755</td>\n",
       "      <td>3.2292</td>\n",
       "      <td>0.3136</td>\n",
       "      <td>0.2251</td>\n",
       "      <td>-76.8601</td>\n",
       "      <td>-38.6044</td>\n",
       "      <td>-13.6838</td>\n",
       "      <td>32.7431</td>\n",
       "      <td>278.4021</td>\n",
       "      <td>h</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Flength   Fwidth   Fsize   Fconc  Fconc1    Fasym   Fm3long  Fm3trans  \\\n",
       "0   24.6124  14.6224  2.3086  0.5012  0.2826   9.1300   14.1010  -13.0228   \n",
       "1  144.7300  43.9751  3.4122  0.1033  0.0571  11.6485 -147.6160   33.0563   \n",
       "2   27.7069   9.7056  2.4023  0.4792  0.2594 -39.0670   15.3697   -6.6979   \n",
       "3   45.0159  20.7843  2.9297  0.2434  0.1311  36.9017  -15.5396   10.2351   \n",
       "4   64.8094  30.1755  3.2292  0.3136  0.2251 -76.8601  -38.6044  -13.6838   \n",
       "\n",
       "    Falpha     Fdist Class  \n",
       "0   4.6460  225.7730     g  \n",
       "1   2.4463  208.0640     h  \n",
       "2  32.8160  154.3420     g  \n",
       "3  10.5580  158.9890     g  \n",
       "4  32.7431  278.4021     h  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tele=telescope_shuffle.reset_index(drop=True)\n",
    "tele.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pre-Processing Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Flength     float64\n",
       "Fwidth      float64\n",
       "Fsize       float64\n",
       "Fconc       float64\n",
       "Fconc1      float64\n",
       "Fasym       float64\n",
       "Fm3long     float64\n",
       "Fm3trans    float64\n",
       "Falpha      float64\n",
       "Fdist       float64\n",
       "Class        object\n",
       "dtype: object"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check the Data Type\n",
    "tele.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Class is a Categorical Variable as can be seen from above. The levels in it are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Levels for catgeory 'Class': ['g' 'h']\n"
     ]
    }
   ],
   "source": [
    "for cat in ['Class']:\n",
    "    print(\"Levels for catgeory '{0}': {1}\".format(cat, tele[cat].unique()))   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, the categorical variable is numerically encoded since TPOT for now cannot handle categorical variables. 'g' is coded as 0 and 'h' as 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tele['Class']=tele['Class'].map({'g':0,'h':1})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A check is performed to see if there are any missing values and code then accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Flength     False\n",
       "Fwidth      False\n",
       "Fsize       False\n",
       "Fconc       False\n",
       "Fconc1      False\n",
       "Fasym       False\n",
       "Fm3long     False\n",
       "Fm3trans    False\n",
       "Falpha      False\n",
       "Fdist       False\n",
       "Class       False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tele = tele.fillna(-999)\n",
    "pd.isnull(tele).any()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(19020, 11)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tele.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we store the class labels, which we need to predict, in a separate variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tele_class = tele['Class'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Analysis using TPOT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(14265, 4755)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_indices, validation_indices = training_indices, testing_indices = train_test_split(tele.index, stratify = tele_class, train_size=0.75, test_size=0.25)\n",
    "training_indices.size, validation_indices.size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After that, we proceed to calling the fit, score and export functions on our training dataset. An important TPOT parameter to set is the number of generations. Here we have set it to 5. On a standard laptop with 4GB RAM, it roughly takes 5 minutes per generation to run. For each added generation, it should take 5 mins more. Thus, for the default value of 100, total run time could be roughly around 8 hours."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generation 1 - Current best internal CV score: 0.84037\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generation 2 - Current best internal CV score: 0.84233\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generation 3 - Current best internal CV score: 0.84359\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generation 4 - Current best internal CV score: 0.84359\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generation 5 - Current best internal CV score: 0.84573\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Best pipeline: _extra_trees(_gradient_boosting(input_df, 0.48999999999999999, 16.0, 0.089999999999999997), 26, 27.0, 0.33000000000000002)\n"
     ]
    }
   ],
   "source": [
    "tpot = TPOT(generations=5, verbosity=2)\n",
    "tpot.fit(tele.drop('Class',axis=1).loc[training_indices].values, tele.loc[training_indices,'Class'].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above, 5 generations were computed, each giving the training efficiency of fitting model on the training set. As evident, the best pipeline is the one that has the CV score of 84.573%. The model that produces this result is one that fits extra trees on gradient boosting algorithm. Next, the testing error is computed for validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.84425705404075746"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tpot.score(tele.drop('Class',axis=1).loc[validation_indices].values, tele.loc[validation_indices, 'Class'].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "As can be seen, the testing accuracy is 84.428%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tpot.export('tpot_MAGIC_Gamma_Telescope_pipeline.py')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Below is the code that got generated. As can be seen gradient boosting did the best here. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load tpot_MAGIC_Gamma_Telescope_pipeline.py\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "\n",
    "# NOTE: Make sure that the class is labeled 'class' in the data file\n",
    "tpot_data = pd.read_csv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR')\n",
    "training_indices, testing_indices = train_test_split(tpot_data.index, stratify = tpot_data['class'].values, train_size=0.75, test_size=0.25)\n",
    "\n",
    "result1 = tpot_data.copy()\n",
    "\n",
    "# Perform classification with a gradient boosting classifier\n",
    "gbc1 = GradientBoostingClassifier(learning_rate=0.49, max_features=1.0, min_weight_fraction_leaf=0.09, n_estimators=500, random_state=42)\n",
    "gbc1.fit(result1.loc[training_indices].drop('class', axis=1).values, result1.loc[training_indices, 'class'].values)\n",
    "\n",
    "result1['gbc1-classification'] = gbc1.predict(result1.drop('class', axis=1).values)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
